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Abstract 

We address the problem of the jomt statistical inference of phyloge- 
netic trees and multiple sequence alignments from unaligned molecular 
sequences. This problem is generally formulated in terms of string- 
valued evolutionary processes along the branches of a phylogenetic 
tree. The classical evolutionary process, the TKF91 model is a 
continuous-time Markov chain model comprised of insertion, deletion 
and substitution events. Unfortunately this model gives rise to an 
intractable computational problem — the computation of the marginal 
likelihood under the TKF91 model is exponential in the number of 
taxa [36]. In this work, we present a new stochastic process, the Pois- 
son Indel Process (PIP), in which the complexity of this computation 
is reduced to linear. The new model is closely related to the TKF91 
model, differing only in its treatment of insertions, but the new model 
has a global characterization as a Poisson process on the phylogeny. 
Standard results for Poisson processes allow key computations to be 
decoupled, which yields the favorable computational profile of infer- 
ence under the PIP model. We present illustrative experiments in 
which Bayesian inference under the PIP model is compared to sepa- 
rate inference of phylogenies and alignments. 



1 Introduction 



The field of pliylogenetic inference is being transformed by the rapid growth 
in availability of molecular sequence data. There is an urgent need for infer- 
ential procedures that can cope with data from large numbers of taxa and 
that can provide inferences for ancestral states and evolutionary parameters 
over increasingly large timespans. Existing procedures are often not scal- 
able along these dimensions and can be a bottleneck in analyses of modern 
molecular datasets. 

A key issue that renders phylogenetic inference difficult is that sequence 
data are generally not aligned a priori, having undergone evolutionary pro- 
cesses that involve insertions and deletions. Consider Figure [T| which depicts 
an evolutionary tree in which each node is associated with a string of nu- 
cleotides, and where the string evolves via insertion, deletion and substitution 
processes along each branch of the tree. Even if we consider evolutionary 
models that are stochastically independent along the branches of the tree 
(conditioning on ancestral states), the inferential problem of inferring evolu- 
tionary paths (conditioning on observed data at the leaves of the tree) does 
not generally decouple into independent computations along the branches of 
the tree. Rather, alignment decisions made throughout the tree can influence 
the posterior distribution on alignments along any branch. 

This issue has come to the fore in a line of research beginning in 1991 
with a seminal paper by Thorne, Kishino and Felsenstein [57|. In the "TKF91 
model," a simple continuous-time Markov chain (CTMC) provides a string- 
valued stochastic process along each branch of an evolutionary tree. This 
makes it possible to define joint probabilities on trees and alignments, and 
thereby obtain likelihoods and posterior distributions for statistical inference. 
A further important development has been the realization that the TKF91 
model can be represented as a hidden Markov model (HMM), and that gener- 
alizations to a broader class of string-valued stochastic processes with finite- 
dimensional marginals are therefore possible [201 ESI EHl EH [13 EZl E31 112] ■ 
This has the appeal that statistical inference under these processes (known 
as transducers) can be based on dynamic programming [21 EH |521 139] . Un- 
fortunately, however, despite some analytic simplification that is feasible in 
restricted cases [55] , the memory needed to represent the state space in these 
models is generally exponential in the number of leaves in the tree [TT] . 
Moreover, even in the simple TKF91 model, there does not appear to be 
additional structure in the state space that allows for simplification of the 
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Figure 1: A depiction of the evolution of a set of strings of nucleotides along 
the branches of a tree with leaves ^ = {vi,V2,V3} and root Q, where each 
string is subject to insertion, deletion and substitution processes. Stars de- 
note nucleotide insertion events, crosses, deletion events, and circles, substi- 
tution events. 



dynamic program. Indeed, the running time of the most sophisticated al- 
gorithm for computing marginals [36] depends on the number of homology 
linearizations, which is exponential in sparse alignments [51]. 

As a consequence of this unfavorable computational complexity, there 
has been extensive work on approximations, specifically on approximations 
to the joint marginal probability of a tree and an alignment, obtained by 
integrating over the derivation |33l [60] . A difficulty, however, is that these 
marginal probabilities play a role in tree inference procedures as the numer- 
ators and denominators of acceptance probabilities for Markov chain Monte 
Carlo algorithms. Loss of accuracy in these values can have large, uncon- 
trolled effects on the overall inference. A second approach is to consider 
joint models that are not obtained by marginalization of a joint continuous- 
time string-valued process. A range of combinatorial [501 ED [291 ISSl [53] and 
probabilistic [T^l I2S1 [13 1131 [11] models fall in this category. Although often 
inspired by continuous-time processes, obtaining a coherent and calibrated 
estimate of uncertainty in these models is difficult. 

A third possible response to the computational complexity of joint in- 
ference of trees and alignments is to retreat to methods that treat these 
problems separately. In particular, as is often done in practice, one can ob- 
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tain a Multiple Sequence Alignment (MSA) via any method (often based on 
a heuristically chosen "guide tree"), and then infer a tree based on the fixed 
alignment. This latter inferential process is generally based on the assump- 
tion that the columns of the alignment are independent, in which case the 
problem decouples into a simple recursion on the tree (the "Felsenstein" or 
"sum-product" recursion [12] )• Such an approach can introduce numerous 
artifacts, however, both in the inferred phylogeny [l3l HH |62], and in the 
inferred alignment [HI HQ] . 

It is also possible to iterate the solution of the MSA problem and the tree 
inference problem [301 EI] , which can be viewed as a heuristic methodology 
for attempting to perform joint inference. The drawbacks of these systems 
include a lack of theoretical understanding, the difficulty of getting calibrated 
confidence intervals, and over-alignment problems [STl 132] . 

Finally, other methods have focused on analyzing only pairs of sequences 
at a time [lH H?! |5T1 15]. While this approach can considerably simplify com- 
putation [121 [S] , it has the disadvantage that it is not based on an underlying 
joint posterior probability distribution. 

In the current paper we present a new approach to the joint probabilistic 
inference of trees and alignments. Our approach is based on a model that 
is closely related to TKF91, altering only the insertion process while leaving 
the deletion and substitution processes intact. Surprisingly, this relatively 
small change has a major mathematical consequence — under the new model 
evolutionary paths have an equivalent global description as a Poisson process 
on the phylogenetic tree. We are then able to exploit standard results for 
Poisson processes (notably, Poisson thinning) to obtain significant computa- 
tional leverage on the problem of computing the joint probability of a tree 
and an alignment. Indeed, under the new model this computation decouples 
in such a way that this joint probability can be obtained in linear time (linear 
in the number of taxa and the sequence length), rather than in exponential 
time as in TKF91. 

Our new model has two descriptions: the first as a local continuous-time 
Markov process that is closely related to the TKF91 model, and the second as 
a global Poisson process. We treat the latter as the fundamental description 
and refer to the new process as the Poisson Indel Process (PIP). The new 
description not only sheds light on computational issues, but it also opens up 
new ways to extend evolutionary models, allowing, for example, models that 
incorporate structural constraints and slipped-strand mispairing phenomena. 

The remainder of the paper is organized as follows. Section [2] provides 
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some basic background on the TKF91 model. In Section |3] presents the 
PIP model, in both its local and global formulations. Section |4] delves into 
the computational aspects of inference under the PIP model, describing the 
linear-time algorithm for computing the exact marginal density of an MSA 
and a tree. In Section [5] we present an empirical evaluation of the inference 
algorithm, and finally we present our conclusions in Section [6j 

2 Background 

We begin by giving a brief overview of the TKF91 model. Instead of following 
the standard treatment based on differential equations, we present a Doob- 
Gillespie view of the model [101 [13] that will be useful in our subsequent 
development. 

Let us assume that at some point in time t, a sequence has length n. 
In the TKF91 process, the sequence stays unchanged for a random interval 
of time At, and after this interval, a single random mutation (substitution, 
insertion or deletion) alters the sequence. This is achieved by defining a total 
of Sra + 1 independent exponential random variables, n of which correspond 
to deletion of a single character, n of which correspond to the mutation of 
a single character and n + 1 of which yield insertions after one of the n 
characters (including one "immortal" position at the leftmost position in the 
string). These 3n + l exponential random variables are simulated in parallel, 
and the value of the smallest of these random variables determines At. The 
index of the winner determines the nature of the event at time At (whether 
it is a substitution, deletion or insertion). 

The random variables corresponding to a deletion have exponential rate 
/^TKF while those corresponding to an insertion have exponential rate Atkf- 
If the event is a mutation, a multinomial random variable with parameters 
obtained from the substitution rate matrix 6 is drawn to determine the new 
value of the character. Finally, if an insertion occurs, a multinomial ran- 
dom variable is drawn to determine the value of the new character, with 
parameters generally taken from the stationary distribution of 6. 

This describes the evolution of a string of characters along a single edge of 
a phylogenetic tree. The extension to the entire phylogeny is straightforward; 
we simply visit the tree in preorder and apply the single-edge process to each 
edge. The distribution of the sequences at the root is generally assumed to 
be the stationary distribution of the single-edge process (conceptually, the 
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Figure 2: Notation used for describing the PIP. Given a phylogenetic tree r 
and a point x G r on that tree, is defined as the subtree rooted at x. 



distribution obtained along an infinitely-long edge). 

Although the TKF91 model is reversible, making the location of the root 
unidentifiable, it is useful to assume for simplicity that an arbitrary root has 
been picked, and we will make that assumption throughout. The likelihood 
is not affected by this arbitrary choice. 

3 The Poisson Indel Process 

In this section we introduce the Poisson Indel Process (PIP). This process 
has two descriptions, a local description which is closely related to the TKF91 
model, and a global description as a Poisson process. 

We require some additional notation (see Figure |2|. A phylogeny r will 
be viewed as a continuous set of points, and its topology will be denoted by 
{i^,(S'), where Y G t is equal to the finite subset containing the branching 
points, the leaves Jtf <Z Y and the root fl, and where S' is the set of edges. 
Parent nodes will be denoted by pa(w), for v & and the branch lengths by 
b{v), which is the length of the edge from pa(f ) to v. For any x G r (whether 
X is a branch point in Y, or an intermediate point on an edge), we write r^, 
for the rooted phylogenetic subtree of r rooted at x (dropping all points in 
the original tree that are not descendants of x). Finally, the set of characters 
(nucleotides or amino acids) will be denoted S. 
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3.1 Local description 

The stochastic process we propose has a local description that is very similar 
to the TKF91 process, the only change being that the insertion rate no longer 
depends on the sequence length. Therefore, instead of using 3n + l competing 
exponential random variables to determine the next event as in the TKF91 
model {n for substitutions, n + 1 for insertions, and n for deletions), we now 
have 2n + 1 variables {n for substitutions, 1 for insertion, with rate A, and 
n for deletion, each of rate /i). When an insertion occurs, its position is 
selected uniformly at randomj^ We assume that the process is initialized by 
sampling a Poisson-distributed number of characters, with parameter X/fi. 
Each character is sampled independently and identically according to the 
stationary distribution of 6. 

Note that if the sequence has length — 1 at some point in time, the 
distribution over the time and type of the next mutation is the same as in 
TKF91, by using the fact that the minimum of exponential variables with 
Aj is exponential, with rate equal to the sum of the Aj. However, in general 
the distributions are different. We discuss some of the biological aspects of 
these differences in Section [6] for now we focus on the computational and 
statistical aspects of the PIP model. 

3.2 Poisson process representation 

We turn to a seemingly very different process for associating character strings 
with a phylogeny. This process consists of two steps, the first involving 
insertions and the second involving deletions and substitutions. This two- 
step approach has antecedents in the literature on the probabilistic modeling 
of morphological or lexical characters [CT [1], but that literature did not 
address the string-valued processes that are our focus here. 

In the first step, depicted in Figure [3}\, a multiset of insertion points is 
sampled from a Poisson process defined on the phylogeny r [2T]. The rate 
measure for this Poisson process has atomic mass at the root of the tree; 
hence the need for multisets rather than simple point sets. We denote this 

^More precisely, assume there is a real number in the interval [0, 1] assigned to each 
character in the string in increasing order: < ri < r2 < • • • < r„ < 1. When an insertion 
occurs, sample a new real number r' uniformly in the interval [0, 1] and insert the new 
character at the unique position (with probability one) such that an increasing sequence 
of real numbers < • • • < r' < • • • < 1 is maintained. 
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Figure 3: Example of a PIP sample. Here S has two symbols, represented by 
red and green squares, and the absorbing deletion symbol e is represented in 
black. (A) A sample from a Poisson process on r. (B) Each sampled point 
corresponds to a rooted tree on which a CTMC path is sampled. (C) The 
alignments and sequences are obtained as a deterministic function of the first 
two steps. 



multiset of insertion points by X. 

In the second step, we visit the insertion points one at a time. The order of 
the visits of the insertions is sampled uniformly at random, (Xi, X2, . . . , Xj) ~ 
Perm(X). An insertion visit consists of two substeps. First, we extract the 
directed subtree rooted at the insertion location Xj. Examples of these sub- 
trees are shown in Figure [3^, left. Second, we simulate the fate of the inserted 
character along r^,. This is done via a substitution- deletion CTMC whose 
state space = S U {e} consists of the basic alphabet S augmented with 
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an empty string symbol e. As shown in Figure |3]B, right, the substitution- 
deletion CTMC yields paths along subtrees in which a single character either 
mutates or is deleted. The latter event, represented by £, is an absorbing 
state. 

We define a homology path Hi as the single-character history generated 
by a substitution-deletion CTMC along a phylogeny. If a point x G r is a 
descendant of the insertion Xj, Hi{x) is set to the state of the substitution- 
deletion CTMC at X. If X G r is not a descendant of Xj, we set Hi{x) to the 
absorbing symbol e. Thus, formally, a homology path Hi is a random map 
from any point on r to S^. 

Given a set of homology paths for each inserted character index the 
sequence at any point on the tree, x G r, is obtained as follows (see Figure|3]C, 
right). First, we construct a list of all the values taken by Hi{x) at the 
given point: (ifi(x), H2{x), . . . , Hj{x)). Second, we remove from the list any 
characters that are equal to the absorbing symbol e. The string obtained 
thereby is denoted by Y{x). The set of observed data comprises the values 
of Y at the leaves of the tree: y = {{v, Y{v)) : v G =5f }. 

We can also construct an MSA M from a set of homology paths (see 
Figure [3p, left). From each homology path Hi, we extract the characters 
at the leaves, arranging these characters in a column. Delete any column in 
which all of the characters are the character e. Arranging these columns in 
the order of the visits to the insertion points. The resulting matrix, whose 
entries range over the augmented alphabet Eg, is the MSA M. 

For a given rooted phylogenetic tree r, we will denote by Prim) the 
marginal probability that this process generates an MSA m, integrating over 
all homology paths, Pr{iTL) = P(M = m). For joint inference, we make the 
phylogenetic tree T random, with a distribution specified by a prior with 
density p{t). 

3.3 Characterization 

In this section we show that the local and the global descriptions of the PIP 
given in the previous two subsections are in fact alternative descriptions of 
the same string-valued stochastic process. In stating our theorem, we let ly 
denote the rate measure characterizing the insertion process in the global 
description, and let Q and vr denote the transition matrix and the initial 
distribution for the substitution-deletion CTMC. 
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Theorem 1 Let t be a phylogenetic tree with an arbitrary rooting, and let 
us denote the Lebesgue measure on r by the same symbol. For any insertion 
rate A > 0, deletion rate n > 0, and reversible substitution rate matrix 9, the 



local and global processes described in Sections \3.1\ and \37^ coincide if we set, 
for all a, a' G Sg.- 

dx) = A f r( dx) H — Sq( dx) 



if a = e 

ji if a' = e 

Oa,a' O.W., 



and set vr to be the quasi- stationary distribution of Q 

The proof is given in the appendix. 

Note that in the case of interest here, where the rate of deletion does not 
depend on the character being deleted, tTo- is equal to the entry of the sta- 
tionary distribution of 9 corresponding to a when a ^ e, and zero otherwise. 

The following result establishes some basic properties of the PIP model. 
Its proof can be found in the appendix. 

Proposition 2 For all fi, X > and reversible rate matrix 9, the PIP model 
is reversible, with a stationary length distribution given by a Poisson distri- 
bution with mean X/fi. 

The Poisson stationary length distribution represents a modeling advan- 
tage of PIP over TKF91, which has a geometrically distributed stationary 
distribution. Based on a study of protein-length distributions for the three 
domains of life [63], the Poisson distribution has been suggested [35] as a 
more adequate length distribution. 

From Proposition [2| we can also obtain an alternative reparameterization 
of the PIP model, in terms of asymptotic expected length rj = X/fi and indel 
intensity ( = X ■ fi. 



4 Computational Aspects 

We turn to a consideration of the computational consequences of the Poisson 
representation of the PIP model. We first consider how the Poisson process 
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characterization allows us to compute the marginal likelihood, pT{m), in 
hnear time; a significant improvement over methods based on the TKF91 
model. We then provide a brief discussion of the role that the marginal 
likelihood plays in inference. 



4.1 Computing the marginal likelihood 

To compute the marginal likelihood, ]9,-(m), we first condition on the number 
of homology paths, |X|. While the number of homology paths is random 
and unknown, we know that it can be no less than the number of columns 
\m\ in the postulated alignment m. We need to consider an unknown and 
unbounded number of birth events with no observed offsprings in the MSA, 
but as they are exchangeable, they can be marginalized analytically. This is 
done as follows: 

Pr{m) = E[P(M = m||X|)] 

oo ^ s 

= E P(|X| = n) ■ (^ ] ■ (Z(C0))"-H H Z{c), 

n=\m\ ^' cem 

where the first factor captures the probability of sampling n homology paths, 
the second, the number of ways to pick the |m| observed homology paths (the 
columns, which contain at least one descendent character at the leaves) out 
of the n paths, the factor Z{c) = P(C = c) is the likelihood of a single MSA 
column c, and C0 is a column with an absorbing deletion symbol at every leaf 
V e C0 = s. 

This expression can be simplified by introducing the function (f defined 
as follows for all 2; e (0, 1), A; e {1, 2, . . . }: 

<p{z,k) = exp{(2;- l)||i/ 

1 



ll^ll = ^ Ikll + 

where ||r|| is the normalization of the measure r, i.e., the sum of all the 
branch lengths in the topology. We show in the appendix that this yields the 
simple formula: 



Pr{m) = <^(^(c0), |m|) Yl Z{c). 
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The next step is to compute the hkehhood Z{c) of each individual ahgn- 
ment column c. We compute this by first conditioning on the random loca- 
tion of the tree at which the insertion point is located for column c. More 
precisely, we look at the most recent common ancestor V = v E Y oi the 
characters in c that are not equal to e. If f 7^ Q, this corresponds to the 
most recent endpoint of the edge e G <f where the insertion occurred. 

Computing the prior probability of the insertion location is greatly sim- 
plified by the fact that Xj||X| ~ u (see [25], Chapter 2.4), where u = 
denotes the probability obtained by normalizing the measure u. If we let 
Ly = F(y = v), we can write: 

^/({^]}) o.w. 

\\t\\ + \ l//i o.w. 

From the t's, the column likelihoods are computed as follows: 
F{C = c) = ^P(£^)P(C = c\£^) 

where fy is the output of a slight modification of Felsenstein's peeling re- 
cursion [T2j applied on the subtree rooted at v (the derivation for fy can be 
found in the appendix). 

Since computing the peeling recursion for one column takes time 0{\^\), 
we get a total running time of 0{\^\ ■ \m\), where |=Sf| is the number of 
observed taxa, and |m| is the number of columns in the alignment. 

4.2 Maximum likelihood 

In the case of maximum likelihood, the overall inference problem involves 
optimizing over the marginal likelihood: 

sup logp,-('^), 

where r ranges over phylogenies on the leaves and m ranges over the align- 
ments consistent with the observed sequences y. This optimization problem 
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can be approached using simulated annealing, where a candidate phylogeny 
and MSA pair (r', m') is proposed at each step z, and is accepted (meaning 
that it replaces the previous candidate (r, m)) according to a sequence of 
acceptance functions f^^\p,p') depending only on the marginal probabilities 
p = Pr{m),p' = pT-i{m'). Provided limj_>oo (p^p') = > p] sufficiently 
slowly, this algorithm converges to the maximum likelihood phylogeny and 
MSA [9]. 

4.3 Bayes estimators 

In order to define a Bayes estimator, one typically specifies a decision space 
D (for example the space of MSAs, or the space of multifurcating tree topolo- 
gies, or both), a projection into this space, (r, m) i— )■ p(r, m) G -D, and a loss 
function Z : D — )■ [0, oo) on D (for example, for tree topologies, the symmet- 
ric clade difference, or partition metric [1]; and for alignments, 1— the edge 
recall or Sum-of-Pairs (SP) score 

Given these objects, the optimal decision in the Bayesian framework (also 
known as the consensus tree or alignment), is obtained by minimizing over 
d G D the risk K[l{d, p(T, M))\y]. This expectation is intractable, so it is 
usually approximated with the empirical distribution of the output (r*^*\ m^*^) 
of an Markov chain Monte Carlo (MCMC) algorithm. 

Producing MCMC samples boils down to computing acceptance ratios of 
the form: 

p{T')pr'{rn') q{r',m'){T, m) 

p{T)pr{m) q^r,m){r',m')' 

for some proposal having density q with respect to a shared reference measure 
on r(^) xMiy). 

We thus see that for both maximum likelihood and joint Bayesian infer- 
ence of the MSA and phylogeny the key problem is that of computing the 
marginal likelihood ^^-(m). 

5 Experiments 

We implemented a system based on our model that performs joint Bayesian 
inference of phylogenies and alignments. We used this system to quantify 
the relative benefits of joint inference relative to separate inference under the 
PIP and TKF91 models; i.e., the benefits of inferring trees on accuracy of 
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the inferred MSA and the benefits of inferring MSAs on the accuracy of the 
inferred tree. 

We used synthetic data to assess the quahty of the tree reconstructions 
produced by PIP, compared to the reconstructions of PhyML 2.4.4, a widely- 
used platform for phylogenetic tree inference [H]. We also compared the 
inferred MSAs to those produced by Clustal 2.0.12 [18J, a popular MSA 
inference system. 

In this study, we explored four types of potential improvements: 

1. Resampling trees and MSAs increasing the quality of inferred MSAs, 
compared to resampling only MSAs. 

2. Resampling trees and MSAs increasing the quality of inferred trees, 
compared to resampling only trees. 

3. Resampling trees increasing the quality of inferred trees, compared to 
trees inferred by PhyML, and fixing the MSA to the one produced by 
Clustal. 

4. Resampling MSAs increasing the quality of inferred MSAs, compared 
to MSAs inferred by Clustal, and fixing the tree to the one produced 
by PhyML. 

The results are shown in Table [T] These experiments were based on 
100 replicas, each having 7 taxa at the leaves, a topology sampled from 
the uniform distribution, branch lengths sampled from rate 2 exponential 
distributions, indels generated from the PIP with parameters i] = 100, C = 1) 
and nucleotides sampled from the Kimura two-parameter model (K2P) . 

We observed improvements of all four types. Comparing Edge Fl relative 
improvements to Robinson-Foulds relative improvements, the relative addi- 
tional improvement of type (2) is larger (13%) than that of type (1) (3%). 
Overall (i.e., comparing the baselines to the joint system), the full improve- 
ments of both trees and MSAs are substantial: 43% Edge Fl improvement, 
and 27% Robinson-Foulds improvement. See Figure |4] for a summary of the 
relative improvements. 

We also tested our system on data generated from the TKF91 model 
instead of the PIP model. We used the same tree distribution and number 
of replicas as in the previous experiments, and the same generating TKF91 
parameters as [2U] . 
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Table 1: PIP results on simulated data 



X 

W 


Tree resampled? 
MSA resampled? 


No 
No 


Yes 
No 


No 
Yes 


Yes 
Yes 


MSAs 


Edge recall (SP) 
Edge Precision 
Edge Fl 


0.25 

0.22 
0.23 




0.22 
0.56 
0.31 


0.24 
0.58 
0.32 


Trees 


Partition Metric 
Robinson- Foulds 


0.24 
0.45 


0.22 
0.38 




0.19 
0.33 



Sampling tree+alignment 




Sampling alignment Sampling tree 




Figure 4: Relative improvements for enabling each component of the sampler. 
Arrows on the left are relative alignment improvements, and arrows on the 
right are relative tree improvements. 



We again observed improvements over the baselines, both in terms of 
MSA and tree quality. For MSAs, the relative improvement over the baseline 
was actually larger on the TKF91-generated data than on the PIP-generated 
data (47% versus 43%, as measured by Edge Fl improvement over Clustal), 
and lower but still substantial for phylogenetic trees (13% versus 27%, as 
measured by Robinson- Foulds improvement over PhyML). 

6 Discussion 

We have presented a novel string-valued evolutionary model that can be used 
for joint inference of phylogenies and multiple sequence alignments. As with 
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its predecessor, the TKF91 model, the new model can be used to capture 
the homology of characters evolving on a phylogenetic tree under insertion, 
deletion and substitution events. Its distinctive property relative to TKF91 
is that it permits a representation as a Poisson process on the tree. This 
representation has the consequence that the marginal likelihood of a tree 
and an alignment (marginalizing over ancestral states) can be computed in 
time linear in the number of taxa, rather than exponential as in the case of 
TKF91. 

Although the insertion process in TKF91 might be argued to be more 
realistic biologically than that of the PIP model, in that it allows the in- 
sertion rate to vary as the sequence length varies, in the common setting 
in which all of the sequences being aligned are of roughly similar lengths, 
this extra degree of freedom may be of limited value for inference. Indeed, 
in our experiments we saw that the PIP model can perform well even when 
data are generated from the TKF91 model. We might also note that there 
are biological processes in which insertions originate from a source that is 
extrinsic to the sequence (e.g., viruses or other genomic regions), in which 
case the constant-rate assumption of PIP may actually be preferred. 

It is also important to acknowledge, however, that neither TKF91 nor PIP 
are accurate representations of biology. Their use in phylogenetic modeling 
reflects the hope that the statistical inferences they permit — most notably 
taking into account the effect of indels on the tree topology — will nonetheless 
be useful as data accrue. This hope is more likely to be realized in larger 
datasets, motivating our goal of obtaining a method that scales to larger 
sets of species. But both models should also be viewed as jumping-off points 
for further modeling that is more faithful to the biology while retaining the 
inferential power of the basic models. For example, there has been significant 
work on extending TKF91 to models that capture the "long indels" that arise 
biologically but are not captured by the basic model [581 EH l37] . 

In this regard, we wish to note that the Poisson representation of the 
PIP model provides new avenues for extension that are not available within 
the TKF91 framework. In particular, the superposition property of Poisson 
processes makes it possible to combine the PIP model with other models 
that follow a Poisson law. For example, if the location X' of long indels, 
slipped-strand mispairing [23] or other non-local changes follow a Poisson 
point process, the union U = XVJX' oi the non-local changes with the point 
indels X provided by a PIP will also be distributed according to a Poisson 
process. Moreover, the thinning property of Poisson processes provides a 



16 



principled approach to inference for such superpositions. Indeed, an MCMC 
sampler for the superposition model can be constructed as follows: first, we 
can exploit the decomposition to analytically marginalize X (using the algo- 
rithm presented in this paper). Second, the other terms of the superposition 
and the sequences at these point in time can be represented explicitly as 
auxiliary variables. Since we have an efficient algorithm for computing the 
marginal likelihood, the auxiliary variables can be resampled easily. Note 
that designing an irreducible sampler without marginalizing X would be dif- 
ficult: integrating out X creates a bridge of positive probability between any 
pair of patterns of non-local changes. 

Finally, another avenue to improve PIP models is to make the insertion 
rate mean measure more realistic: instead of being uniform across the tree, 
it could be modeled using a prior distribution, hence forming a Cox process 
[7]. This would be most useful when the sequences under study have large 
length or indel intensity variations across sites and branches |54j . 
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A Proofs for the Main PIP Properties 

In this section, we prove Proposition [T] and [2j We begin by stating and 
proving two lemmas. 

Lemma 3 Let U ~ Unif(0,t) and W ~ Exp(/i) he independent for fixed 
t, ^ > 0. Then 

tjjj 

Proof By conditioning: 

F{W + U>t) = E[F{W + U > t\U)] 

"* exp(— xu) , 
- ax 



t 

1 — exp(— tyu) 

tjJL 



□ 



Lemma 4 Let tq denote a degenerate topology consisting of a root Vt con- 
nected to a single leaf vq by an edge of length t. Let Hi he a homology path as 



defined in Section 3.2, with r = Tq. For all x G r, define I{x) = {i : Hi{x) ^ 

N = \m)\ 

N' = \I{v,)\. 

Then N ~ Poi(A/yu) implies N' ~ Poi(A//x). 

Proof To prove the result, we decompose N and N' as follows (see Fig- 
ure |5|: 



iVi = 


m)\Hvo)\ 


N2 = 


\i{n)ni{vo)\ 


N3 = 


\iivo)\im 


N,= 


\i\imiivo)\ 


N = 


N, + N2 


N' = 


N2 + N3. 
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Figure 5: Notation used in the appendix. The horizontal hues denote the 
times where each character is present in the sequence. The vertical line on 
the left denotes the sequence at fl, and the vertical line on the right, the 
sequence at Vq. The sites are decomposed depending on whether they are 
present at each of two points fl, vq in tq. 



By the Coloring Theorem [25] . 

N2 ^ Foi{iy{{n})F{W > t)) , 

where is a rate /i exponential random variable, and u is as in the condition 
of Proposition [1} Therefore ~ Poi(Aexp(—t/i)//i). Similarly, 



iVg ~ Poi {u{T\{n})F(W + U >t)), 



where U ~ Unif(0,t), and therefore from Lemma|3| A^3 ~ Poi(A(l— exp(— 
It follows that: 

N' = N2 + iVs 



Poi (-e->^+-(l~e-^)) 



Poi 



'V 

which concludes the proof of the lemma. □ 
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We can now prove Proposition [Tj 

Proof In order to establish the equivalence, it is enough to show that for 
all edges e = (f — )■ v') in the tree, the following two properties hold: 

1. The distribution of the string length at the ancestral endpoint, 

is identical in the local and global descriptions: a Poisson distribution 
with rate X/ fi. 

2. The distribution of the number and locations of mutations that fall on 
e\{v,v'} are also identical in the local and global descriptions. 

We will enumerate the edges in the tree in preorder, using induction to 
establish these two hypotheses on this list of edges. 

In the base case, hypothesis 1 is satisfied by construction: the local de- 
scription is initialized with a Poi(A//i)-distributed number of characters, and 
in the global description, the intensity measure u of the Poisson process X 
assigns a point mass X/ to v = fl. 

To establish hypothesis 1 in the inductive case, let e' = {y" v) denote 
the parent edge. By hypothesis 1 on e', ~ Poi(A//i), therefore by 

Lemma |4] and hypothesis 2 on e', hypothesis 1 is satisfied on e as well. 

To estabhsh hypothesis 2, it is enough to show that for all x G e\{v,v'} 
the waiting time for each type of mutation given Y{x) is exponential, with 
rates: 

(a) A for insertion, 

(b) jj, ■ \ Y{x)\ for deletion, and 

(c) Xlcr^e substitutions to a' ^ e, where \s\a- denotes the 
number of characters of type cr G S in the string s G S*. 

Item (a) follows from the Poisson Interval Theorem [25]. Items (b) and (c) 
follow from the standard Doob- Gillespie characterization of CTMCs: if Xt 
is a CTMC with rate matrix Q = (qij) and Zij are independent exponential 
random variables with rate qij, then 

(A, J)\{Xo = i) = (minZij,argminZij), 
where A = inf{t ■.Xt^i},J = X^. □ 
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We now turn to Proposition [2] and establish reversibility. 



Proof Let h{ni,n2,n^,n4) = F{Ni = rii^i E {1,2,3,4}). Using reversibil- 
ity of 6', it is enough to show that h is invariant under the permutation (1 3); 
i.e., /i(ni,n2,n3,n4) = h{n3,n2,ni,n4). 

We have that n2, ns, 714) is equal to: 



F{Ni = ni,N + N' = n + n,N = n,N' = n) 

= F{N + N' = n + n')x 

F{N = n,N' = n'\N + N' = n + n')x 
W>{Ni = ni,N2 = n2\N = n)x 
W>{N3 = = n^lN' = n') 

= h{ni + ^2 + ^3 + n4)x 



where the precise form of the functions /i, /2 and /s is not important in this 
argument. By inspection, it is clear that h is invariant under the permutation 




(1 3). 



□ 
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B Proofs for the Likelihood Computation 



First, we show how the function ip, defined in Section |4| simphfies the com- 
putation of pT-{m): 

Pr{m) = E[P(M = m||X|)] 

= P(|X| = • (/^ ) ■ (^(C0))"-H J] Z{c) 

n=\m\ ^ ^ c6m 



|m|!(Z(c0))l™l (n-\m\)\ 

n=\m\ 



^^''HM\zic,))^'^^UcemZ{c)^i\W\\zic,)Y 



|m|!(Z(c0))IH ^ A;! 

= |m|!(Z(c0))H -pdlHI^M) 

= <^(^(C0), |m|) JJ Z(c). 

Next, we show how to compute = P(C = c\V = v) for all v E Y. Note 
first that /t, can be zero for some vertices. To see where and why, consider 
the set of leaves 5* that that have at least one observation H{v) 7^ e in the 
current column c. Then will be non-zero only for the vertices ancestral to 
some leaf in S. Let us call this set of vertices A (see Figure |6|. 

To cornpute fy on the remaining vertices, we introduce an intermediate 
variable, fy = P(C = c\V = v,H{v) 7^ e). This variable can be computed 
using the standard Felsenstein peeling recursion (dynamic programming) as 
follows: 



fv{o- 



l{c{v) = a) ifve^ 
Ea'es, exp{b{v)Q)^y n«;6chiid(^) /»(^') o.w. 



0-es 

From Lemma |3| we have an expression for the survival probability at 
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Figure 6: Given a set S of leaves v with H{v) ^ e, we define the set A of 
edges with nonzero modified Felsenstein peehng weight to be those ancestral 
to the leaves in S. 



given an insertion on the edge (pa(f) — )■ v): 

/3{v) = F{H{v) ^ e\V = v) 



b{v) jj, 

Finally, for c 7^ C0, we have: 

h = nc = c\v = v) 



and for c = c%\ 

fv 



E[P(C = c\V = v,H{v))] 
fv i{v = Q 

l[veA]P{v)f, O.W., 



fv ^ iiv = Q 

l + (3{v){f,-l) o.w. 



C Proposal distributions 

To perform full joint inference over trees and alignments using Markov chain 
Monte Carlo, several objects need to be resampled: the tree topology, the 
branch lengths, the MSA, and the parameters. 
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For trees and branch lengths, we use standard proposal mechanisms . 
Our MSA proposal is inspired by the proposal of [33], avoiding the mixing 
problems of auxiliary variables [20ll22| l3]. Our proposal distribution consists 
of two steps. First, we partition the leaves into two sets A, B. Given a current 
MSA mo, the support of the proposal is the set S of MS As m satisfying the 
following constraints: 

1. If e has both endpoints in A (or both in i?), then e (z m <^==^ e G mg. 

2. If e, e' have both endpoints in A (or both in B), then e -<m e' '^==^ 



The notation is based on the concept of posets over the columns (and 
edges) of an MSA |5l]. 

We propose an element m* E S with probability proportional to Hcem* ^i^)- 
The set S has exponential size, but can be sampled efficiently using standard 
pairwise alignment dynamic programming. A Metropolis-Hastings ratio is 
then computed to correct for (f. Note that the proposal induces an irreducible 
chain: one possible outcome of the move is to remove all links between two 
groups of sequences. The chain can therefore move to the empty MSA and 
then construct any MSA incrementally. 

For the parameters, we used multiplicative proposals in the (A, /i) param- 
eterization 128) . 



e -<, 



e'. 
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